A Machine Learning Framework for Combined Information Extraction and Integration∗

نویسندگان

  • Fei Wu
  • Zaiqing Nie
  • Ji-Rong Wen
  • Wei-Ying Ma
چکیده

There are various kinds of objects embedded in static Web pages and online Web databases. Extracting and integrating these objects from the Web is of great significance for Web data management. The existing Web information extraction (IE) techniques cannot provide satisfactory solution to the Web object extraction task since objects of the same type are distributed in diverse Web sources, whose structures are highly heterogenous. The classic information extraction (IE) methods, which are designed for processing plain text documents, also fail to meet our requirements. In this paper, we propose a novel approach called Object-Level Information Extraction (OLIE) to extract Web objects. This approach extends a classic IE algorithm, Conditional Random Fields (CRF), by adding Web-specific information. It is essentially a combination of Web IE and classic IE. Specifically, visual information on the Web pages is used to select appropriate atomic elements for extraction and also to distinguish attributes, and structured information from external Web databases is applied to assist the extraction process. The experimental results show OLIE can significantly improve the Web object extraction accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluating machine learning methods and satellite images to estimate combined climatic indices

The reflections recorded on satellite images have been affected by various environmental factors. In these images, some of these factors are combined with other environmental factors that cannot be distinguished. Therefore, it seems wise to model these environmental phenomena in the form of hybrid indicators. In this regard, satellite imagery and machine learning methods can play a unique role ...

متن کامل

An Approach to Management of Health Care and Medical Diagnosis Using of a Hybrid Disease Diagnosis System

Introduction: In order to simplify the information exchange within the medical diagnosis process, a collaborative software agent’s framework is presented. The purpose of the framework is to allow the automated information exchange between different medicine specialists. Methods: This study presented architecture of a hybrid disease diagnosis system. The architecture employed a learning...

متن کامل

A Hierarchical Production Planning and Finite Scheduling Framework for Part Families in Flexible Job-shop (with a case study)

Tendency to optimization in last decades has resulted in creating multi-product manufacturing systems. Production planning in such systems is difficult, because optimal production volume that is calculated must be consistent with limitation of production system. Hence, integration has been proposed to decide about these problems concurrently. Main problem in integration is how we can relate pro...

متن کامل

SKEL’s Research on Machine Learning for Information Integration on the Web

This short paper highlights the research activity of the Software and Knowledge Engineering Laboratory (SKEL) on machine learning methods to support information Integration on the Web. SKEL belongs in the Institute of Informatics and Telecommunications of the National Centre for Scientific Research “Demokritos”, and has recently coordinated the European research project CROSSMARC, which produce...

متن کامل

Trust Classification in Social Networks Using Combined Machine Learning Algorithms and Fuzzy Logic

Social networks have become the main infrastructure of today’s daily activities of people during the last decade. In these networks, users interact with each other, share their interests on resources and present their opinions about these resources or spread their information. Since each user has a limited knowledge of other users and most of them are anonymous, the trust factor plays an import...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004